ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

作者信息

SJTU Minyi Guo小导师jieru zhang组 Jieru Zhao's Homepage

链接:

[2412.03213] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

摘要

Large Language Models (LLMs) have been widely deployed in a variety of applications, and the context length is rapidly increasing to handle tasks such as long-document QA and complex logical reasoning. However, long context poses significant challenges for inference efficiency, including high memory costs of key-value (KV) cache and increased latency due to extensive memory accesses. Recent works have proposed compressing KV cache to approximate computation, but these methods either evict tokens permanently, never recalling them for later inference, or recall previous tokens at the granularity of pages divided by textual positions. Both approaches degrade the model accuracy and output quality. To achieve efficient and accurate recallable KV cache compression, we introduce ClusterKV, which recalls tokens at the granularity of semantic clusters.【从语义簇的角度调回】 We design and implement efficient algorithms and systems for clustering, selection, indexing and caching. Experiment results show that ClusterKV attains negligible accuracy loss across various tasks with 32k context lengths, using only a 1k to 2k KV cache budget, and achieves up to a 2× speedup in latency and a 2.5× improvement in decoding throughput. Compared to SoTA recallable KV compression methods, ClusterKV demonstrates higher model accuracy and output quality, while maintaining or exceeding inference efficiency.

一句话总结概括

语义簇稀疏化attention

Motivation

  1. Attention计算的稀疏性:部分KV需要和Q进行Attention需要计算

  2. 在语义空间中,比较近的attention weight会计算出一个比较近似的attention结果

  3. 用余弦相似度代表向量距离可以更好地表达语义空间

image-20241227153850766

  1. 注意力汇

创新点或贡献

Cluster Attention:语义空间中,k和q离得越近的tokens簇,在attention计算中关系越大。

image-20241227150548932

image-20241227151611616

具体设计

  1. 对key vectors进行Kmeans计算,保存16个attention sinks离异点。且这里强调了只对生成的tokens进行聚类,而不对prefill的tokens进行聚类。

实验评估

背景

先前工作存在的问题概述

难点

补充背景

思考角度

我如何做这个问题

这个洞见可以引申出其他其他方法吗

该洞见是否可以迁移到其他领域中

该工作有什么可能可以改进的地方

Q&A

results matching ""

    No results matching ""